STAT 301 Project: Stage 2¶

Group 3: Helena Sokolovska¶

In [1]:
# install.packages("ggtern")
# install.packages("ggcorrplot")
# install.packages("cowplot")
# install.packages("patchwork")
# install.packages("gridExtra")
# install.packages("png")
# install.packages("grid")
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("nnet")
# install.packages("broom")

library(tidyverse)
library(ggtern)
library(dplyr)
library(ggcorrplot)
library(cowplot)
library(patchwork)
library(gridExtra)
library(png)
library(grid)
library(nnet)
library(broom)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Registered S3 methods overwritten by 'ggtern':
  method           from   
  grid.draw.ggplot ggplot2
  plot.ggplot      ggplot2
  print.ggplot     ggplot2

--
Remember to cite, run citation(package = 'ggtern') for further info.
--


Attaching package: ‘ggtern’


The following objects are masked from ‘package:ggplot2’:

    aes, annotate, ggplot, ggplot_build, ggplot_gtable, ggplotGrob,
    ggsave, layer_data, theme_bw, theme_classic, theme_dark,
    theme_gray, theme_light, theme_linedraw, theme_minimal, theme_void



Attaching package: ‘cowplot’


The following object is masked from ‘package:lubridate’:

    stamp



Attaching package: ‘patchwork’


The following object is masked from ‘package:cowplot’:

    align_plots



Attaching package: ‘gridExtra’


The following objects are masked from ‘package:ggtern’:

    arrangeGrob, grid.arrange


The following object is masked from ‘package:dplyr’:

    combine


0. TA Feedback¶

Stage 1: 30/30

Section 1: Pre-selection - only variables that definitely won’t be useful (e.g. the redundant ones; the ID column) should be dropped in this stage. This pre-selection happens before you form your scientific question.

Section 3: The correlation plot would be clearer with numeric labels on the cells. Interpretation: Max 2-3 sentences for each point.

1. Data Description¶

  • The Customer Personality Analysis dataset was provided by Dr. Omar Romero-Hernandez and published by user Akash Patel on Kaggle.
  • This is a detailed collection of a company's customer history, with customer demographics, behavior, products purchased, and route (channel) of purchase.
  • The dataset recorded 29 variables for 2240 customers, though it is not known how this data was sampled or collected.
Variable Description Type
Demographics
ID Customer's unique identifier Numeric
Year_Birth Customer's birth year Numeric: Temporal
Education Customer's education level Categorical: Ordinal
Marital_Status Customer's marital status Categorical: Nominal
Income Customer's yearly household income Numeric
Kidhome Number of children in customer's household Numeric
Teenhome Number of teenagers in customer's household Numeric
Behavior
Dt_Customer Date of customer's enrollment with the company Numeric: Temporal
Recency Number of days since customer's last purchase Numeric
NumWebVisitsMonth Number of visits to company’s website in the last month Numeric
NumDealsPurchases
Number of purchases made with a discount Numeric
AcceptedCmp1 1 if customer accepted the offer in the 1st campaign, 0 otherwise Categorical: Binary
AcceptedCmp2 1 if customer accepted the offer in the 2nd campaign, 0 otherwise Categorical: Binary
AcceptedCmp3 1 if customer accepted the offer in the 3rd campaign, 0 otherwise Categorical: Binary
AcceptedCmp4 1 if customer accepted the offer in the 4th campaign, 0 otherwise Categorical: Binary
AcceptedCmp5 1 if customer accepted the offer in the 5th campaign, 0 otherwise Categorical: Binary
Response 1 if customer accepted the offer in the last campaign, 0 otherwise Categorical: Binary
Complain 1 if the customer complained in the last 2 years, 0 otherwise Categorical: Binary
Z_CostContact Unclear: all observations are 3 Numeric
Z_Revenue Unclear: all observations are 11 Numeric
Products
MntWines Amount spent on wine in last 2 years Numeric
MntFruits Amount spent on fruits in last 2 years Numeric
MntMeatProducts Amount spent on meat in last 2 years Numeric
MntFishProducts Amount spent on fish in last 2 years Numeric
MntSweetProducts Amount spent on sweets in last 2 years Numeric
MntGoldProds Amount spent on gold in last 2 years Numeric
Channel
NumWebPurchases Number of purchases made through the company’s website Numeric
NumCatalogPurchases Number of purchases made using a catalogue Numeric
NumStorePurchases Number of purchases made directly in stores Numeric

Pre-Selection of Variables¶

In [2]:
customers <- read_tsv("marketing_campaign.csv")
head(customers)
Rows: 2240 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (3): Education, Marital_Status, Dt_Customer
dbl (26): ID, Year_Birth, Income, Kidhome, Teenhome, Recency, MntWines, MntF...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 29
IDYear_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWines⋯NumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainZ_CostContactZ_RevenueResponse
<dbl><dbl><chr><chr><dbl><dbl><dbl><chr><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
55241957GraduationSingle 581380004-09-201258635⋯70000003111
21741954GraduationSingle 463441108-03-201438 11⋯50000003110
41411965GraduationTogether716130021-08-201326426⋯40000003110
61821984GraduationTogether266461010-02-201426 11⋯60000003110
53241981PhD Married 582931019-01-201494173⋯50000003110
74461967Master Together625130109-09-201316520⋯60000003110
In [3]:
sapply(customers, function(x) length(unique(x)))
ID
2240
Year_Birth
59
Education
5
Marital_Status
8
Income
1975
Kidhome
3
Teenhome
3
Dt_Customer
663
Recency
100
MntWines
776
MntFruits
158
MntMeatProducts
558
MntFishProducts
182
MntSweetProducts
177
MntGoldProds
213
NumDealsPurchases
15
NumWebPurchases
15
NumCatalogPurchases
14
NumStorePurchases
14
NumWebVisitsMonth
16
AcceptedCmp3
2
AcceptedCmp4
2
AcceptedCmp5
2
AcceptedCmp1
2
AcceptedCmp2
2
Complain
2
Z_CostContact
1
Z_Revenue
1
Response
2
  • We can confirm above that all the observations in Z_CostContact and Z_Revenue are the same value.

Final Selection¶

  • We will remove Z_CostContact and Z_Revenue from the dataset, since their meaning is not clear and our analysis (below) shows that all observations have the same value.
  • We will remove Products variables since they are not relevant to my research question.
  • We will remove ID, since this column is not useful for analysis.
In [4]:
customers <- customers %>%
    select(-Z_CostContact, -Z_Revenue, -ID)

2. Scientific Question¶

What is the association between customer traits (demographic and behavioral factors outlined above) and their preferred shopping channel (online, catalog, or store)?¶

  • Model: multinomial logistic regression.
  • Response: customer's preferred shopping channel (online, catalog, or store).
  • Inference: extracting the association between customer traits and their preferred shopping channel.

3. Exploratory Data Analysis and Visualization¶

In [5]:
head(customers)
summary(customers)
A tibble: 6 × 26
Year_BirthEducationMarital_StatusIncomeKidhomeTeenhomeDt_CustomerRecencyMntWinesMntFruits⋯NumCatalogPurchasesNumStorePurchasesNumWebVisitsMonthAcceptedCmp3AcceptedCmp4AcceptedCmp5AcceptedCmp1AcceptedCmp2ComplainResponse
<dbl><chr><chr><dbl><dbl><dbl><chr><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1957GraduationSingle 581380004-09-20125863588⋯10 470000001
1954GraduationSingle 463441108-03-201438 11 1⋯ 1 250000000
1965GraduationTogether716130021-08-20132642649⋯ 21040000000
1984GraduationTogether266461010-02-201426 11 4⋯ 0 460000000
1981PhD Married 582931019-01-20149417343⋯ 3 650000000
1967Master Together625130109-09-20131652042⋯ 41060000000
   Year_Birth    Education         Marital_Status         Income      
 Min.   :1893   Length:2240        Length:2240        Min.   :  1730  
 1st Qu.:1959   Class :character   Class :character   1st Qu.: 35303  
 Median :1970   Mode  :character   Mode  :character   Median : 51382  
 Mean   :1969                                         Mean   : 52247  
 3rd Qu.:1977                                         3rd Qu.: 68522  
 Max.   :1996                                         Max.   :666666  
                                                      NA's   :24      
    Kidhome          Teenhome      Dt_Customer           Recency     
 Min.   :0.0000   Min.   :0.0000   Length:2240        Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.:0.0000   Class :character   1st Qu.:24.00  
 Median :0.0000   Median :0.0000   Mode  :character   Median :49.00  
 Mean   :0.4442   Mean   :0.5062                      Mean   :49.11  
 3rd Qu.:1.0000   3rd Qu.:1.0000                      3rd Qu.:74.00  
 Max.   :2.0000   Max.   :2.0000                      Max.   :99.00  
                                                                     
    MntWines         MntFruits     MntMeatProducts  MntFishProducts 
 Min.   :   0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.00  
 1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16.0   1st Qu.:  3.00  
 Median : 173.50   Median :  8.0   Median :  67.0   Median : 12.00  
 Mean   : 303.94   Mean   : 26.3   Mean   : 166.9   Mean   : 37.53  
 3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232.0   3rd Qu.: 50.00  
 Max.   :1493.00   Max.   :199.0   Max.   :1725.0   Max.   :259.00  
                                                                    
 MntSweetProducts  MntGoldProds    NumDealsPurchases NumWebPurchases 
 Min.   :  0.00   Min.   :  0.00   Min.   : 0.000    Min.   : 0.000  
 1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000    1st Qu.: 2.000  
 Median :  8.00   Median : 24.00   Median : 2.000    Median : 4.000  
 Mean   : 27.06   Mean   : 44.02   Mean   : 2.325    Mean   : 4.085  
 3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000    3rd Qu.: 6.000  
 Max.   :263.00   Max.   :362.00   Max.   :15.000    Max.   :27.000  
                                                                     
 NumCatalogPurchases NumStorePurchases NumWebVisitsMonth  AcceptedCmp3    
 Min.   : 0.000      Min.   : 0.00     Min.   : 0.000    Min.   :0.00000  
 1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000    1st Qu.:0.00000  
 Median : 2.000      Median : 5.00     Median : 6.000    Median :0.00000  
 Mean   : 2.662      Mean   : 5.79     Mean   : 5.317    Mean   :0.07277  
 3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000    3rd Qu.:0.00000  
 Max.   :28.000      Max.   :13.00     Max.   :20.000    Max.   :1.00000  
                                                                          
  AcceptedCmp4      AcceptedCmp5      AcceptedCmp1      AcceptedCmp2    
 Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
 Mean   :0.07455   Mean   :0.07277   Mean   :0.06429   Mean   :0.01339  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
                                                                        
    Complain           Response     
 Min.   :0.000000   Min.   :0.0000  
 1st Qu.:0.000000   1st Qu.:0.0000  
 Median :0.000000   Median :0.0000  
 Mean   :0.009375   Mean   :0.1491  
 3rd Qu.:0.000000   3rd Qu.:0.0000  
 Max.   :1.000000   Max.   :1.0000  
                                    
In [6]:
colSums(is.na(customers))
Year_Birth
0
Education
0
Marital_Status
0
Income
24
Kidhome
0
Teenhome
0
Dt_Customer
0
Recency
0
MntWines
0
MntFruits
0
MntMeatProducts
0
MntFishProducts
0
MntSweetProducts
0
MntGoldProds
0
NumDealsPurchases
0
NumWebPurchases
0
NumCatalogPurchases
0
NumStorePurchases
0
NumWebVisitsMonth
0
AcceptedCmp3
0
AcceptedCmp4
0
AcceptedCmp5
0
AcceptedCmp1
0
AcceptedCmp2
0
Complain
0
Response
0
  • There are 24 values missing from the Income column - we will need to remove these rows.
In [7]:
categoricals = select(customers, c(Education, Marital_Status))
lapply(categoricals, unique)
$Education
  1. 'Graduation'
  2. 'PhD'
  3. 'Master'
  4. 'Basic'
  5. '2n Cycle'
$Marital_Status
  1. 'Single'
  2. 'Together'
  3. 'Married'
  4. 'Divorced'
  5. 'Widow'
  6. 'Alone'
  7. 'Absurd'
  8. 'YOLO'
  • '2n Cycle' is the European equivalent of a Master's degree, so we will combine the 2 categories.
  • 'Alone' and 'Single' are synonymous, so we will combine the 2 categories.
  • 'Absurd' and 'YOLO' are "troll responses, so we will remove these rows.

Cleaning + Wrangling¶

Given the observations above, we will make the following changes to wrangle the dataset:

  • Remove products variables, since they are not relevant to our research question.
  • Convert number of purchases via each shopping channel to proportion of purchases made via each shopping channel (for logistic regression).
  • Remove observations with 'N/A' values (24 rows).
  • Recode '2n Cycle' to 'Master' in Education.
  • Recode 'Alone' to 'Single' in Marital_Status.
  • Remove observations with "troll" Marital_Status responses ('YOLO' and 'Absurd').
  • Calculate current age in 2025 from Year_Birth, for better interpretability of correlation plot.
  • Calculate customer tenure in days (as of today) from date of enrollment with company, for better interpretability of correlation plot.
  • Rename columns to more readable names for visualizations.
In [8]:
# clean dataset
customers_clean <- customers %>%
    # remove products variables
    select(-starts_with("Mnt")) %>%
    # compute proportion of total purchases
    mutate(
        TotalPurchases = NumWebPurchases + NumCatalogPurchases + NumStorePurchases,
        PropWebPurchases = NumWebPurchases / TotalPurchases,
        PropCatalogPurchases = NumCatalogPurchases / TotalPurchases,
        PropStorePurchases = NumStorePurchases / TotalPurchases
    ) %>%
    # remove observations with any N/A values
    na.omit() %>%    
    # Education: change "2n Cycle" to "Master"
    mutate(Education = recode(Education, "2n Cycle" = "Master")) %>%
    # Marital_Status: change "Alone" to "Single" 
    mutate(Marital_Status = recode(Marital_Status, "Alone" = "Single")) %>% 
    # remove observations with "troll" Marital_Status responses
    filter(!Marital_Status %in% c("YOLO", "Absurd")) %>% 
    # calculate current age from "Year_Birth" for better interpretability of correlation plot.
    mutate(Age_in_2025 = 2025 - Year_Birth) %>%
    # calculating customer tenure (as of today) from date of enrollment with company for better interpretability of correlation plot.
    mutate(Dt_Customer = as.Date(Dt_Customer, format = "%d-%m-%Y")) %>%
    mutate(Tenure_Days_Today = as.numeric(Sys.Date() - Dt_Customer)) %>%
    # rename columns to more readable names for visualizations
    rename(
        Kids_at_Home = Kidhome,
        Teens_at_Home = Teenhome,
        Date_Enrollment = Dt_Customer,
        Num_Web_Visits_Monthly = NumWebVisitsMonth,
        Num_Discount_Purchases = NumDealsPurchases,
        Accepted_Last_Campaign_Offer = Response,
        Complained_Past_2_Years = Complain
    ) %>%
    rename_with(~ gsub("^AcceptedCmp", "Accepted_Campaign_Offer_", .x), starts_with("AcceptedCmp"))

Ternary Plots¶

  • I will generate ternary plots of proportion of purchases via each shopping channel and colour the points by variables with few categories.
In [9]:
sapply(customers_clean, function(x) length(unique(x)))
Year_Birth
59
Education
4
Marital_Status
5
Income
1967
Kids_at_Home
3
Teens_at_Home
3
Date_Enrollment
662
Recency
100
Num_Discount_Purchases
15
NumWebPurchases
15
NumCatalogPurchases
14
NumStorePurchases
14
Num_Web_Visits_Monthly
15
Accepted_Campaign_Offer_3
2
Accepted_Campaign_Offer_4
2
Accepted_Campaign_Offer_5
2
Accepted_Campaign_Offer_1
2
Accepted_Campaign_Offer_2
2
Complained_Past_2_Years
2
Accepted_Last_Campaign_Offer
2
TotalPurchases
32
PropWebPurchases
126
PropCatalogPurchases
128
PropStorePurchases
121
Age_in_2025
59
Tenure_Days_Today
662
  • My options are:
    • Education (4 categories)
    • Marital_Status (5 categories)
    • Kids_at_Home, Teens_at_Home (binary)
    • Accepted_Campaign_Offer_1, Accepted_Campaign_Offer_2, Accepted_Campaign_Offer_3, Accepted_Campaign_Offer_4, Accepted_Campaign_Offer_5, Accepted_Last_Campaign_Offer (binary): since we do not know about the campaigns, visualizations would not be particularly informative.
    • Complained_Past_2_Years (binary): too few customers complained for any visible patterns.
  • Because I can't visualize too many ternary plots for my final visualization and based on the reasons above, I will focus on visualizing 4 demographic variables: Education, Marital_Status, Kids_at_Home, and Teens_at_Home.
In [10]:
options(repr.plot.width = 6, repr.plot.height = 6)

# function to generate ternary plot with specified variable for colour channel
plot_ternary <- function(color_var){
    # ensure color_var is a factor
    df <- customers_clean %>%
        mutate(!!color_var := as.factor(.data[[color_var]]))
    
    ggtern(data = df, aes_string(x = "PropWebPurchases", y = "PropCatalogPurchases", z = "PropStorePurchases", color = color_var)) +
        geom_point(size = 2, alpha = 0.7) +
        theme_minimal() +
        labs(
          title = paste("Customer Purchase Composition by", color_var),
          x = "Web",
          y = "Catalog",
          z = "Store",
          color = color_var
        ) +
        theme_showgrid()
}

vars_to_plot <- c("Education", "Marital_Status", "Kids_at_Home", "Teens_at_Home")

ternary_plot_list <- list()

# add ternary plot to ternary_plot_list
for (v in vars_to_plot) {
    p <- plot_ternary(v)
    print(p)
    ternary_plot_list[[v]] <- p
}
Warning message:
“`aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.”
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Correlation Plot¶

  • Labelled each cell with its corresponding correlation value.
In [11]:
options(repr.plot.width = 12, repr.plot.height = 12)

# keep only numeric variables
customers_numeric <- customers_clean %>%
  # keep numeric and integer columns
  select(where(is.numeric)) %>% 
  # remove irrelevant columns
  select(-Year_Birth, -TotalPurchases)
  
# create correlation matrices for each shopping channel
corr_web <- cor(customers_numeric, customers_numeric$PropWebPurchases, use = "complete.obs")
corr_catalog <- cor(customers_numeric, customers_numeric$PropCatalogPurchases, use = "complete.obs")
corr_store <- cor(customers_numeric, customers_numeric$PropStorePurchases, use = "complete.obs")

# combine correlations into df
corr_df <- data.frame(
  Variable = rownames(corr_web),
  Web = corr_web,
  Catalog = corr_catalog,
  Store = corr_store
)

# remove the channels themselves from the list
corr_df <- corr_df %>% filter(!Variable %in% c("NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases",
                                               "PropWebPurchases", "PropCatalogPurchases", "PropStorePurchases"))
corr_long <- corr_df %>%
  pivot_longer(cols = c(Web, Catalog, Store),
               names_to = "Channel",
               values_to = "Correlation")

corr_plot <- ggplot(corr_long, aes(y = Channel, x = reorder(Variable, Correlation), fill = Correlation)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(Correlation, 2)), size = 3) +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", limits = c(-1, 1)) +
  labs(title = "Correlation of Customer Traits with Proportion of Purchases via Each Shopping Channel",
       y = "Proportion of Purchases via Shopping Channel", x = "Customer Trait") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), axis.text.y = element_text(size = 10))
corr_plot
No description has been provided for this image

Combining Visualizations¶

  • Unfortunately, combining visualizations via R packages removes the grid lines for my ternary plots, so I will combine .png images of the visualizations.
In [12]:
# folder to save ternary plots
dir.create("ternary_plots", showWarnings = FALSE)
for (name in names(ternary_plot_list)) {
  ggsave(
    filename = paste0("ternary_plots/", name, ".png"), 
    plot = ternary_plot_list[[name]],
    width = 6, height = 6, dpi = 400
  )
}

# saving correlation plot
ggsave("corr_plot.png", corr_plot, width = 10, height = 6, dpi = 300)
In [13]:
options(repr.plot.width = 12, repr.plot.height = 12)

# read ternary plot images
img1 <- rasterGrob(readPNG("ternary_plots/Education.png"), interpolate = TRUE)
img2 <- rasterGrob(readPNG("ternary_plots/Marital_Status.png"), interpolate = TRUE)
img3 <- rasterGrob(readPNG("ternary_plots/Kids_at_Home.png"), interpolate = TRUE)
img4 <- rasterGrob(readPNG("ternary_plots/Teens_at_Home.png"), interpolate = TRUE)

# arrange in a 2x2 grid
tern_plots <- grid.arrange(img1, img2, img3, img4, ncol = 2)
ggsave("tern_plots.png", tern_plots, width = 12, height = 12, dpi = 400)
No description has been provided for this image

Final Visualization¶

  • View final_plots.png for better resolution of this visualization.
In [14]:
options(repr.plot.width = 18, repr.plot.height = 18)

img5 <- rasterGrob(readPNG("tern_plots.png"), interpolate = TRUE)
img6 <- rasterGrob(readPNG("corr_plot.png"), interpolate = TRUE)

final_plots <- grid.arrange(img5, img6, ncol = 1)
ggsave("final_plots.png", final_plots, width = 12, height = 18, dpi = 300)
No description has been provided for this image

Interpretations¶

Ternary Plots¶

  • Visualizes the association between customer demographic variables and the proportion of purchases via each shopping channel: this will help me determine the strength of the associations and which variables to include in my predictive model.
  • Points are not clustered in the center of the plot, showing that customers in this dataset generally bias towards making more purchases in the store and fewer purchases via the catalog.
Coloured by Education: shows the association between customer education and the proportion of purchases via each shopping channel.¶
  • No clear trend or bias in this plot: proportions of purchases appear evenly distributed across channels by education.
Coloured by Marital_Status: shows the association between customer marital status and the proportion of purchases via each shopping channel.¶
  • No clear trend or bias in this plot: proportions of purchases appear evenly distributed across channels by marital status.
Coloured by Kids_at_Home: shows the association between the number of kids the customer has at home and the proportion of purchases via each shopping channel.¶
  • Customers with 1+ kids at home appear to make more web purchases (lower-left section).
Coloured by Teens_at_Home: shows the association between the number of teens the customer has at home and the proportion of purchases via each shopping channel.¶
  • Many more customers with teens at home than kids at home.
  • There is a cluster of customers with 1+ teens at home that appears to make more web purchases (lower-left section), and a group of customers with 1+ teens at home that appears to make fewer store purchases (left section).
Conclusion¶
  • The number of kids/teens at home are likely to be significantly associated with and predictors of the proportion of purchases made via each shopping channel.
  • There are many more customers with teens at home than kids at home, which may influence the strength of the latter predictor.
  • There are very few customers that complained within the past 2 years, so this is unlikely to make a good predictor for our model.

Correlation Plot¶

  • Shows the correlation between numeric variables of customer traits and the proportion of purchases via each shopping channel: this will help me determine the strength of the associations and which variables to include in my predictive model.
Catalog channel:¶
  • Proportion of catalog purchases strongly positively correlated with income, and slightly positively correlated with accepting campaign offers.
  • Proportion of catalog purchases strongly negatively correlated with number of monthly web visits and kids at home.
Store channel:¶
  • Proportion of store purchases slightly positively correlated with kids at home.
  • Proportion of store purchases slightly negatively correlated with income and accepting the last campaign offer.
Web channel:¶
  • Proportion of web purchases strongly positively correlated with number of monthly web visits, and slightly positively correlated with the number of discount purchases and kids at home.
  • Proportion of web purchases slightly negatively correlated with income.
Conclusion¶
  • The number of kids at home, income, monthly web visits, number of discount purchases, and acceptance of campaign offers are likely to be significantly associated with and predictors of the proportion of purchases made via each shopping channel.
    • Specifically the number of kids at home, income, and monthly web visits are likely to be strong predictors.

4. Method and Plan¶

Model: Multinomial Logistic Regression¶

  • Multinomial logistic regression is designed for categorical outcomes with more than two unordered categories (web, catalog, or store), making it ideal for modeling which shopping channel a customer prefers.

Assumptions¶

  • Independence of observations: ie. each customer is independent.
  • Independence of irrelevant alternatives (IIA): the odds of preferring one shopping channel shouldn't change if we remove another channel.
  • No severe multicollinearity among predictors.
  • Linear relationship between predictors and the log-odds of each outcome.
  • All predictors are homoscedastic.

Limitations¶

  • IIA is often unrealistic: shopping channel preferences would change if we removed another channel.
  • By only considering the log-odds of preferring a shopping channel, we lose information about how many purchases the customer would make via each channel.
  • Model coefficients are relative to the store reference category: I will not produce results for web vs. catalog.

5. Computational Code and Output¶

More Wrangling¶

  • Shaped data into long format with Channel and corresponding Count columns.
  • Some numeric predictors are on vastly different scales - in order to easily compare effect sizes across predictors and optimize convergence, I scaled numeric variables.
    • I also tried to generate a model without scaling, and received improbable results: tiny standard errors (eg. 10^-8), huge z-values (eg. 10^7), and p-values of exactly 0 for almost every term.
  • Changed reference category to store channel.
In [15]:
# shaped data into long format with `Channel` and `Count` columns
customers_long <- customers_clean %>%
  pivot_longer(
    cols = c(NumWebPurchases, NumCatalogPurchases, NumStorePurchases),
    names_to = "Channel",
    values_to = "Count"
  ) %>%
  mutate(
    Channel = case_when(
      Channel == "NumWebPurchases" ~ "Web",
      Channel == "NumCatalogPurchases" ~ "Catalog",
      Channel == "NumStorePurchases" ~ "Store"
    ),
    Channel = factor(Channel)
  )

# scaling numeric variables
numeric_cols <- c(
  "Year_Birth", "Income", "Kids_at_Home", "Teens_at_Home",
  "Tenure_Days_Today", "Recency", "Num_Web_Visits_Monthly", "Num_Discount_Purchases"
)
customers_long[numeric_cols] <- scale(customers_long[numeric_cols])

# changing reference category
customers_long$Channel <- relevel(customers_long$Channel, ref = "Store")

Building Model¶

I will start with an additive model: with 17 predictors and a 3-category outcome, I am already estimating a lot of parameters. I will consider interactions after I have reduced my model.

In [16]:
multinom_model <- multinom(
  Channel ~ Year_Birth + Education + Marital_Status + Income + Kids_at_Home + Teens_at_Home +
            Tenure_Days_Today + Recency + Num_Web_Visits_Monthly + Num_Discount_Purchases + Complained_Past_2_Years +
            Accepted_Campaign_Offer_1 + Accepted_Campaign_Offer_2 + Accepted_Campaign_Offer_3 + Accepted_Campaign_Offer_4 +
            Accepted_Campaign_Offer_5 + Accepted_Last_Campaign_Offer,
  data = customers_long,
  weights = Count,
  na.action = na.exclude
)
# weights:  72 (46 variable)
initial  value 30501.871583 
iter  10 value 28622.144543
iter  20 value 28593.365032
iter  30 value 28342.354729
iter  40 value 28281.066403
iter  50 value 28237.776009
final  value 28237.735995 
converged

Results¶

In [17]:
summary_model <- summary(multinom_model)

coefs <- summary_model$coefficients
se    <- summary_model$standard.errors

# calculate z and p values
z <- coefs / se
p <- 2 * (1 - pnorm(abs(z)))

# calculate 95% CI
lower_ci <- coefs - 1.96 * se
upper_ci <- coefs + 1.96 * se

# exponentiate log-odds to odds
odds_ratio      <- exp(coefs)
odds_lower_95   <- exp(lower_ci)
odds_upper_95   <- exp(upper_ci)

results <- data.frame(
  Outcome     = rep(rownames(coefs), each = ncol(coefs)),
  Predictor   = rep(colnames(coefs), times = nrow(coefs)),
  Coefficient = as.vector(coefs),
  Std_Error   = as.vector(se),
  z           = as.vector(z),
  p           = as.vector(p),
  Odds_Ratio  = as.vector(odds_ratio),
  OR_Lower_95 = as.vector(odds_lower_95),
  OR_Upper_95 = as.vector(odds_upper_95)
)

results_sorted <- results %>% arrange(Outcome, p)
results_sorted
A data.frame: 46 × 9
OutcomePredictorCoefficientStd_ErrorzpOdds_RatioOR_Lower_95OR_Upper_95
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
CatalogAccepted_Campaign_Offer_2 0.29741342320.03056683 9.729940200.000000e+001.34637181.26807811.4294995
CatalogAccepted_Campaign_Offer_4 -0.28061705790.02774987-10.112374410.000000e+000.75531750.71533320.7975368
CatalogAccepted_Last_Campaign_Offer-0.13199047090.02102694 -6.277208633.447054e-100.87634930.84096670.9132207
Catalog(Intercept) -1.39178968200.22266965 -6.250468734.092227e-100.24862990.16069900.3846747
CatalogAccepted_Campaign_Offer_3 0.14932889880.02679034 5.573983382.489795e-081.16105481.10166191.2236497
CatalogAccepted_Campaign_Offer_5 -0.06741963650.01969200 -3.423706786.177323e-040.93480280.89941030.9715881
CatalogYear_Birth -0.42862583310.13888884 -3.086106992.027958e-030.65140360.49616440.8552139
CatalogEducationGraduation -0.03614493530.01675022 -2.157878173.093731e-020.96450050.93334970.9966910
CatalogEducationMaster -0.02637023870.01531569 -1.721779538.510946e-020.97397440.94517141.0036552
CatalogNum_Web_Visits_Monthly 0.05676590200.05977403 0.949674983.422774e-011.05840800.94139621.1899639
CatalogNum_Discount_Purchases 0.04324855240.05052680 0.855952653.920240e-011.04419740.94574341.1529007
CatalogComplained_Past_2_Years 0.07347446440.09443291 0.778059924.365337e-011.07624110.89438941.2950677
CatalogEducationPhD 0.15263760770.21933475 0.695911644.864842e-011.16490280.75785861.7905695
CatalogMarital_StatusWidow 0.15179376820.22126740 0.686019584.927008e-011.16392020.75435641.7958490
CatalogKids_at_Home 0.02186057200.05711706 0.382732767.019179e-011.02210130.91385001.1431756
CatalogMarital_StatusSingle 0.07891260030.22046223 0.357941587.203870e-011.08210970.70244131.6669883
CatalogMarital_StatusMarried 0.03650967900.13432939 0.271792197.857818e-011.03718430.79709941.3495825
CatalogIncome 0.02232535330.13685205 0.163134958.704122e-011.02257640.78199681.3371699
CatalogRecency -0.00610894070.05258623 -0.116169969.075178e-010.99390970.89657091.1018164
CatalogTenure_Days_Today -0.00651278310.06159747 -0.105731359.157955e-010.99350840.88051901.1209968
CatalogTeens_at_Home 0.00387965890.04812035 0.080624089.357409e-011.00388720.91353261.1031785
CatalogMarital_StatusTogether -0.01058863620.13566617 -0.078049209.377889e-010.98946720.75843791.2908708
CatalogAccepted_Campaign_Offer_1 -0.00573640360.08298537 -0.069125489.448897e-010.99428000.84502611.1698961
Web Marital_StatusSingle 0.33206610300.02220042 14.957648580.000000e+001.39384501.33449541.4558340
Web Marital_StatusTogether 0.21060064940.01913288 11.007263900.000000e+001.23441931.18898521.2815895
Web Num_Discount_Purchases 0.53649949700.06158035 8.712186900.000000e+001.71001051.51558581.9293767
Web Marital_StatusMarried -0.21324618340.02588287 -8.238892622.220446e-160.80795720.76799150.8500027
Web Accepted_Campaign_Offer_5 0.26477973250.05033178 5.260686571.435185e-071.30314391.18072591.4382542
Web Accepted_Last_Campaign_Offer 0.19133101730.04478744 4.271979751.937452e-051.21086021.10909861.3219586
Web Year_Birth 0.07267968830.01870511 3.885552831.020973e-041.07538601.03667421.1155435
Web EducationGraduation -0.06044302360.01611743 -3.750164591.767185e-040.94134740.91207490.9715594
Web Num_Web_Visits_Monthly -0.38878597710.11381976 -3.415803826.359404e-040.67787930.54233430.8473010
Web EducationMaster 0.05116655150.01657048 3.087813682.016348e-031.05249821.01886411.0872425
Web Complained_Past_2_Years 0.12374547450.05632362 2.197044092.801730e-021.13172781.01344071.2638211
Web Marital_StatusWidow 0.03577057040.01643498 2.176490542.951860e-021.03641801.00356441.0703472
Web Accepted_Campaign_Offer_3 -0.11855440810.05771746 -2.054047543.997109e-020.88820350.79319940.9945865
Web Accepted_Campaign_Offer_2 -0.08639027300.05199626 -1.661470959.661890e-020.91723620.82836371.0156435
Web Teens_at_Home 0.09072461770.05711035 1.588584431.121542e-011.09496740.97901171.2246571
Web EducationPhD 0.01861968450.01427820 1.304063621.922119e-011.01879410.99067811.0477080
Web Accepted_Campaign_Offer_4 0.02378665570.05695065 0.417671356.761874e-011.02407180.91591051.1450061
Web Income -0.07292816390.18537249 -0.393414186.940136e-010.92966760.64645081.3369647
Web Tenure_Days_Today 0.02166940660.05584625 0.388019046.980019e-011.02190590.91595391.1401138
Web Kids_at_Home -0.04375536810.15181292 -0.288219007.731791e-010.95718810.71083941.2889114
Web Recency -0.01185944400.11488785 -0.103226279.177834e-010.98821060.78896001.2377816
Web Accepted_Campaign_Offer_1 0.00450467770.05861134 0.076856769.387375e-011.00451480.89549961.1268013
Web (Intercept) 0.00019662150.01718630 0.011440599.908719e-011.00019660.96706601.0344623

Interpretation¶

Choosing catalog over store shopping:¶

At a significance level of alpha = 0.05:

  • Customers that are younger are 35% less likely, graduated customers (relative to basic education) are 4% less likely, customers that accepted campaign offer 2, 3, 4, 5, and the last offer are 35% more likely, 16% more likely, 25% less likely, 7% less likely, and 12% less likely, respectively, to choose catalog over store shopping.

Choosing web over store shopping:¶

At a significance level of alpha = 0.001 (reporting most significant predictors):

  • Customers that are single were 39% more likely, common-law are 23% more likely, married are 19% less likely, younger are 7% more likely, customers with more discount purchases are 71% more likely, customers with more web visits are 32% less likely, customers with a Master's are 5% more likely, and graduated customers are 6% less likely to choose web over store shopping.
  • Customers that accepted campaign offer 5 and the last offer are 30% and 21% more likely, respectively, to choose web over store shopping.

Summary¶

  • It appears that a preference for catalog shopping is associated with customers that are older and particularly sensitive to campaigns 2 (positive), 3 (positive), and 4 (negative).
  • Preference for shopping on the web is associated with customers that are younger, not married, and make more discount purchases (deal-seekers).
  • Unexpectedly, more monthly web visits are associated with lowers odds of preferring web shopping - these customers may be browsing but not buying over the web.